AITopics | test error

Collaborating Authors

test error

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

From Linear to Nonlinear: Provable Weak-to-Strong Generalization through Feature Learning

Neural Information Processing SystemsJun-23-2026, 03:22:58 GMT

Weak-to-strong generalization refers to the phenomenon where a stronger model trained under supervision from a weaker one can outperform its teacher. While prior studies aim to explain this effect, most theoretical insights are limited to abstract frameworks or linear/random feature models. In this paper, we provide a formal analysis of weak-to-strong generalization from a linear CNN (weak) to a two-layer ReLUCNN (strong). We consider structured data composed of labeldependent signals of varying difficulty and label-independent noise, and analyze gradient descent dynamics when the strong model is trained on data labeled by the pretrained weak model. Our analysis identifies two regimes--data-scarce and data-abundant--based on the signal-to-noise characteristics of the dataset, and reveals distinct mechanisms of weak-to-strong generalization. In the datascarce regime, generalization occurs via benign overfitting or fails via harmful overfitting, depending on the amount of data, and we characterize the transition boundary. In the data-abundant regime, generalization emerges in the early phase through label correction, but we observe that overtraining can subsequently degrade performance.

artificial intelligence, inequality follow, machine learning, (19 more...)

Neural Information Processing Systems

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry:

Education (0.45)
Information Technology (0.45)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

On the Mechanisms of Weak-to-Strong Generalization: ATheoretical Perspective

Neural Information Processing SystemsJun-21-2026, 21:20:04 GMT

Weak-to-strong generalization--where a student model trained on imperfect labels generated by a weaker teacher nonetheless surpasses that teacher--has been widely observed, but the mechanisms that enable it have remained poorly understood. In this paper, through a theoretical analysis of simple models, we uncover three core mechanisms that can drive this phenomenon. First, by analyzing ridge linear regression, we study the interplay between the teacher and student regularization parameters and prove that a student can compensate for a teacher's under-regularization and achieve lower test error. We also analyze the role of the parameterization regime of the models and show that qualitatively different phenomena can happen in different regimes. Second, by analyzing weighted ridge linear regression, we show that a student model with a regularization structure better aligned to the target function, can outperform its teacher. Third, in a nonlinear multi-index learning setting, we demonstrate that a student can learn easy, task-specific features from the teacher while leveraging its own broader pre-training to learn hard-to-learn features that the teacher cannot capture.

machine learning, natural language, weak-to-strong generalization, (14 more...)

Neural Information Processing Systems

Country: North America > United States > Pennsylvania (0.28)

Genre:

Research Report > Experimental Study (1.00)
Overview (0.67)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.54)

Add feedback

The Nuclear Route: Sharp Asymptotics of ERM in Overparameterized Quadratic Networks

Neural Information Processing SystemsJun-19-2026, 01:34:58 GMT

We study the high-dimensional asymptotics of empirical risk minimization (ERM) in over-parametrized two-layer neural networks with quadratic activations trained on synthetic data. We derive sharp asymptotics for both training and test errors by mapping the ℓ2-regularized learning problem to a convex matrix sensing task with nuclear norm penalization. This reveals that capacity control in such networks emerges from a low-rank structure in the learned feature maps. Our results characterize the global minima of the loss and yield precise generalization thresholds, showing how the width of the target function governs learnability. This analysis bridges and extends ideas from spin-glass methods, matrix factorization, and convex optimization and emphasizes the deep link between low-rank matrix sensing and learning in quadratic neural networks.

artificial intelligence, machine learning, regularization, (15 more...)

Neural Information Processing Systems

Country: Europe (0.28)

Genre: Research Report > Experimental Study (1.00)

Industry: Education (0.65)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks

Neural Information Processing SystemsJun-15-2026, 17:59:18 GMT

Understanding the inductive bias and generalization properties of large overparametrized machine learning models requires to characterize the dynamics of the training algorithm. We study the learning dynamics of large two-layer neural networks via dynamical mean field theory, a well established technique of nonequilibrium statistical physics. We show that, for large network width m, and large number of samples per input dimension n/d, the training dynamics exhibits a separation of timescales which implies: (i) The emergence of a slow time scale associated with the growth in Gaussian/Rademacher complexity of the network; (ii) Inductive bias towards small complexity if the initialization has small enough complexity; (iii) A dynamical decoupling between feature learning and overfitting regimes; (iv)A non-monotone behavior of the test error, associated'feature unlearning' regime at large times.

artificial intelligence, deep learning, machine learning, (16 more...)

Neural Information Processing Systems

Country: Europe (0.45)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.92)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.45)

Add feedback

Understanding the Generalization of Stochastic Gradient Adam in Learning Neural Networks

Neural Information Processing SystemsJun-15-2026, 07:41:49 GMT

Adam is a popular and widely used adaptive gradient method in deep learning, which has also received tremendous focus in theoretical research. However, most existing theoretical work primarily analyzes its full-batch version, which differs fundamentally from the stochastic variant used in practice. Unlike SGD, stochastic Adam does not converge to its full-batch counterpart even with infinitesimal learning rates. We present the first theoretical characterization of how batch size affects Adam's generalization, analyzing two-layer over-parameterized CNNs on image data. Our results reveal that while both Adam and AdamW with proper weight decay λ converge to poor test error solutions, their mini-batch variants can achieve near-zero test error. We further prove Adam has a strictly smaller effective weight decay bound than AdamW, theoretically explaining why Adam requires more sensitive λtuning.

adamw, artificial intelligence, machine learning, (17 more...)

Neural Information Processing Systems

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Decoupled Descent: Exact Test Error Tracking Via Approximate Message Passing

Lovig, Max

arXiv.org Machine LearningMay-1-2026

In modern parametric model training, full-batch gradient descent (and its variants) suffers due to progressively stronger biasing towards the exact realization of training data; this drives the systematic ``generalization gap'', where the train error becomes an unreliable proxy for test error. Existing approaches either argue this gap is benign through complex analysis or sacrifice data to a validation set. In contrast, we introduce decoupled descent (DD), a novel theory-based training algorithm that satisfies a train-test identity -- enforcing the train error to asymptotically track the test error for stylized Gaussian mixture models. Within this specific regime, leveraging approximate message passing theory, DD iteratively cancels the biases due to data reuse, rigorously demonstrating the feasibility of zero-cost validation and $100\%$ data utilization. Moreover, DD is governed by a low-dimensional state evolution recursion, rendering the dynamics of the algorithm transparent and tractable. We validate DD on XOR classification, yielding superior performance compared to GD; additionally, we implement noisy MNIST and non-linear probing of CIFAR-10, demonstrating that even when our stylized assumptions are relaxed, DD narrows the generalization gap compared to GD.

artificial intelligence, assumption, machine learning, (18 more...)

arXiv.org Machine Learning

2604.27883

Country: North America > Canada (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

A Theoretically Grounded Application of Dropout in Recurrent Neural Networks

Yarin Gal, Zoubin Ghahramani

Neural Information Processing SystemsApr-30-2026, 20:26:52 GMT

Neural Information Processing Systems http://nips.cc/

artificial intelligence, dropout, machine learning, (18 more...)

Neural Information Processing Systems

Country: Europe (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.68)

Add feedback

DAMEX: Dataset-aware Mixture-of-Experts for visual understanding of mixture-of-datasets Supplementary Material Anonymous Author(s) Affiliation Address email

Neural Information Processing SystemsApr-30-2026, 00:06:07 GMT

Here we provide theoretical evidence that vanilla MoE do not6 guarantee convergence when mixing multiple datasets. Consider a binary classification problem over P-patch inputs where each8 patch has d dimensions and label y = { 1}. Thus, a labeled data point (x,y) has input x =9 (x(1),x(2),x(3),...,x(P)) (Rd)P is a collection of P patch inputs with y as the data label. The10 data x is generated from K clusters.11 Chen et al. [2022] proves that in such a binary-classification problem, an MoE layer converges to an12 o(1) test loss and zero training loss.

artificial intelligence, machine learning, mixture-of-dataset supplementary material anonymous author, (11 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

47a5feca4ce02883a5643e295c7ce6cd-Supplemental.pdf

Neural Information Processing SystemsApr-25-2026, 17:12:31 GMT

artificial intelligence, isotropic, machine learning, (18 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Filters

Collaborating Authors

test error

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

From Linear to Nonlinear: Provable Weak-to-Strong Generalization through Feature Learning

On the Mechanisms of Weak-to-Strong Generalization: ATheoretical Perspective

The Nuclear Route: Sharp Asymptotics of ERM in Overparameterized Quadratic Networks

Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks

Understanding the Generalization of Stochastic Gradient Adam in Learning Neural Networks

Decoupled Descent: Exact Test Error Tracking Via Approximate Message Passing

A Theoretically Grounded Application of Dropout in Recurrent Neural Networks

e4d3fe32495088805bbbb4f1de63e947-Paper-Conference.pdf

DAMEX: Dataset-aware Mixture-of-Experts for visual understanding of mixture-of-datasets Supplementary Material Anonymous Author(s) Affiliation Address email

47a5feca4ce02883a5643e295c7ce6cd-Supplemental.pdf